Textual Article Clustering in Newspaper Pages

نویسندگان

  • Marco Aiello
  • Andrea Pegoretti
چکیده

In the analysis of a newspaper page an important step is the clustering of various text blocks into logical units, i.e., into articles. We propose three algorithms based on text processing techniques to cluster articles in newspaper pages. Based on the complexity of the three algorithms and experiment on actual pages from the Italian newspaper L’Adige, we select one of the algorithms as the preferred choice to solve the textual clustering problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering in Newspaper Pages

In the analysis of a newspaper page an important step is the clustering of various text blocks into logical units, i.e., into articles. We propose three algorithms based on text processing techniques to cluster articles in newspaper pages. Based on the complexity of the three algorithms and experimentation on actual pages from the Italian newspaper L’Adige, we select one of the algorithms as th...

متن کامل

An Architecture for Efficient News Items Clustering and Retrieval Based on Language Models for a Dynamic Collection of E- Newspapers

Newspaper pages comprises of multiple individual articles divided into multiple columns. The challenging part of this task is to organize and integrate article blocks in the newspaper. This paper proposes a novel approach for Article reconstruction from newspapersincluding an aggregation of multiple sections of article and reading order recovery of each individual article.Thus,the process combi...

متن کامل

Linking article parts for the creation of newspaper digital library

An important issue pertaining to the retro-conversion of newspapers, i.e. the conversion of newspaper issues into digital resources, is the identification and appropriate digital representation of an article. To complete this task, a number of steps have to be followed, from segmentation of the newspaper image to optical character recognition and linking of different items belonging to the same...

متن کامل

Web pages, text types, and linguistic features: Some issues

1 Introduction With the growth of the Web a massive quantity of documents, namely web pages, are freely available for (corpus-)linguistic studies. Web pages can be considered as a new kind of document, much more unpredictable and individualized than paper documents. While the linear organization of most paper documents is still reflected in traditional electronic corpora, such as the British Na...

متن کامل

Metadiscourse Markers: A Contrastive Study of Translated and Non-Translated Persuasive Texts

Metadiscourse features are those facets of a text, which make the organization of the text explicit, provide information about the writer's attitude toward the text content, and engage the reader in the interaction. This study interpreted metadiscourse markers in translated and non-translated persuasive texts. To this end, the researcher chose the translated versions of one of the leading newsp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Applied Artificial Intelligence

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2006